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Abstract 

The inequality of Vapnik and Chervonenkis controls the expectation of the function 
by its sample average uniformly over a VC-major class of functions taking into account 
the size of the expectation. Using Talagrand's kernel method we prove a similar result 
for the classes of functions for which Dudley's uniform entropy integral or bracketing 
entropy integral is finite. 



1 Introduction and main results. 

Let Q be a measurable space with a probability measure P and Q n be a product space with 
a product measure P n . Consider a family of measurable functions T = {/ : Q — > [0,1]}. 
Denote 

„ n 



1=1 



The main purpose of this paper is to provide probabilistic bounds for Pf in terms of / 
and the complexity assumptions on class T . We are trying to extend the following result of 
Vapnik and Chervonenkis ([Hj). Let C be a class of sets in fl. Let 



S(n) = max < {xu ■ ■ ■ , x n \ fl C : C E C > 



*The work was done during summer internship at AT&T Research Labs at the Laboratory of Speech and 
Image Processing. 
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The VC dimension d of class C is denned as 



d = mf{j>l:S(j)<y}. 

C is called VC if d < oo. The class of functions T is called VC-major if the class of sets 

C = Ux G fi : /(ar) < t} : / G .F, t G i?} 

is a VC class of sets in Q, and the VC dimension of T is defined as the VC dimension of C. 
The inequality of Vapnik and Chervonenkis states that (see Theorem 5.3 in [TH]) if F is a 
VC-major class of [0, 1] valued functions with dimension d then for all 5 > with probability 
at least 1 — 5 for all / G T 



(P/) l/2 



n 



J2(Pf - f(*i)) < 2(- logS(2n) + -log^) V2 , (l.r 



where for n > d, S(n) can be bounded by 



l^see [16j) to give 



^(p/- /( ^))< 2 riog^+iiog|) . (i.2) 



n(Pf) 1 / 2 ^ \n d n 5 



The factor (P/)~ 1//2 allows interpolation between the n -1 rate for Pf in the optimistic zero 
error case / = and the n _1//2 rate in the pessimistic case when / is "large". In this paper 
we will prove a bound of a similar nature under different assumptions on the complexity 
of the class T . Using Talagrand's abstract concentration inequality in product spaces and 
the related kernel method for empirical processes ^3] we will first prove a general result 
that interpolates between optimistic and pessimistic cases. Then we will give examples of 
application of this general result in two situations when it is assumed that either Dudley's 
uniform entropy integral is finite or the bracketing entropy integral is finite. 

Let us formulate Talagrand's concentration inequality that is used in the proof of our 
main Theorem 2 below. Consider a probability measure v on Q n and i6ff. We will denote 
by Xi the i th coordinate of x. If Cj = {y G Q n : yi ^ x^}, we consider the image of the 
restriction of v to Ci by the map y — > yi, and its Radon-Nikodym derivative di with respect 
to P. As in [14J we assume that Q is finite and each point is measurable with a positive 
measure. Let m be a number of atoms in Q and Pi, ■ ■ ■ ,p m be their probabilities. By the 
definition of di we have 

g{Vi)dv(y) = / g{yi)di(yi)dP(yi). 
d Jo, 



For a > we define a function ip a (x) by 

if) a (x) = 



x 2 /(4a), when x < 2a, 
x — a, when x > 2a. 
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We set 

i<n J 

For each a > let L a be any positive number satisfying the following inequality: 

2LJe 1 / La - 1) , s 

1 + 2L. ^ (0) 

The following theorem holds (see PJ). 

Theorem 1 Let a > and L a satisfy il.ty . Then for any n and A C fi n we /iave 

/ 6XP 7^ m «(^)dP n (x) < p^iy- (1-4) 

Below we will only use this theorem for a = 1 and L\ pa 1.12. Let us introduce the normalized 
empirical process as 

1 n 

Z(x)=sup — ^(Pf-ffa)), xeQ n , 

where (p : T — > (0, oo) is a function such that Z has a finite median M = M(Z) < oo, i.e. 

P(Z > M) < - and > P(Z >M + e)<^. (1.5) 

The factor will play the same role as (nPf) 1 ^ 2 plays in (jl.lj) The following theorem 
holds. 

Theorem 2 Lei L pa 1.12. // i ll. 5)) Zioids then for any u > 0, 

p(a/ G ^ - f{ Xi )) > Mip(f) + 2^/LnuPf) < 2e~ u (1.6) 

i<n 

Proof. The proof of the theorem repeats the proof of Theorem 2 in [5| with some minor 
modifications, but we will give it here for completeness. Let us consider the set A = {Z(x) < 
M}. Clearly, P n (A) > 1/2. Let us fix a point x G Q n and then choose / G T . For any point 
y G A we have 

1 n 

Therefore, for any probability measure v such that v{A) = 1 we will have 

-^y E( p / " /(*<)) - M ^Jr)[ (p/ " ~ ^ (P/ ~ ^)))^(y) 



^yE J(f(yi)-f^iM(yi)dP(yi 
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It is easy to observe that for v > 0, and — 1 < u < 1, 

uv < u 2 I{u > 0) +ipi{v). (1.7) 

Therefore, for any 5 > 1 

J2(Pf - f(xi)) - Mip(f) < 5j2f fiyt) ~ fiXl) d t (y t )dP( yi ) 

i<n i<n 

^E J(f(y l )-f(^)) 2 I(f(^)>f(^))dP(y l ) + 6j2 J Mdt)dP 

i<n i<n 

Taking the infimum over v we obtain that for any 5 > 1 
J2(Pf - f(xi)) < Mtptf) + ^ E / (/(»*) - /fo)) 8 *(/(w) > f{xi))dP{ Vi ) + 5 mi (A, x). 

i<n i<n 

Let us denote the random variable £ = f(yi), F^(t) - the distribution function of £, and 
Q = f(xi). For c G [0, 1] define the function h(c) as 

\2r/i/„, \ ^ „\jd/„ \ / l± „\2. 



Hc) = J (f(y 1 )-cYl(f(y 1 )>c)dP(y 1 ) = J (t-c)^(t). 
One can check that h(c) is decreasing, convex, h(0) = Pf 2 and = 0. Therefore, 

I E < (I E *) mi) + (i - ^ E c ) *(°) = C 1 - /) p / 2 - 

Hence, we showed that 

]T(P/ - f(xi) < M<f(f) + - 6 nPf + S mi (A, x). 

i<n 

Theorem 1 then implies via the application of Chebyshev's inequality that with probability 
at least 1 — 2e~ u , mi (A, x) < Lu and, hence 

i<n 

For u < nPf/L the infimum over 5 > 1 equals 2y/LnuPf. On the other hand, for u > nPf/L 
this infimum is greater than 2nPf whereas the left-hand side is always less than nPf. 

□ 

We will now give two examples of normalization ip(f) where we can prove that ()1.5|) 
holds. 
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1.1 Uniform entropy conditions. 



Given a probability distribution Q on Vt we denote 

d Q ,2(f,9) = (Q(f-g) 2 ) 1/2 

an L 2 — distance on JF with respect to Q. Given u > we say that a subset J 7 ' G J 7 
is m— separated if for any / 7^ g G J 7 ' we have dQ t 2{f,g) > u. Let the packing number 
D{T , u, L 2 (Q)) be the maximal cardinality of any u— separated set. We will say that T 
satisfies the uniform entropy condition if 



/■oo 

/ \J\ogD{T , u)du < 00, (l.i 
Jo 



where 

sup D(F,u,L 2 (Q)) <D(f,u) 
Q 

and the supremum is taken over all discrete probability measures. It is well known (see, for 
example, jS]) that if one considers the subset T v = {/ G T : Pf < p}, then the expectation 
of supjc- p Yl(Pf~f( x i)) can be estimated (in some sense, since the symmetrization argument 
is required) by 

VP 

<p(p) = Vn J ^log D(F,u)du. (1.9) 


We will prove that it holds for all p > simultaneously. 

Theorem 3 Assume that D(!F, 1) > 2 and M.ty) holds. If ip is defined by M.ty then the 
median 

1 n 

M = M(sup Y,(Pf - /(*,))) < K < 00, 

is finite, where K is an absolute constant. 

Proof. The proof is based on standard symmetrization and chaining techniques. We will 
first prove that 

T.{Pf-f{xi)) \ ( J2(f(yi)-f(xi)) ( 2 \i/2\ 

sup ^ y \ - 1,1 > u ) < 2P( sup ^ u yy -[ — -^- 2ll >u-[- I I. 1.10 

where 

= ^ £(/(»<)+/(**))■ 

Let 

A r EW-/fo)) ^ 1 
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nVaxf 1 
- 2nPf ~ 2' 



Let x G A and / G T be such that EC^/ — f(xi))/ip(Pf) > u. Chebyshev's inequality 
implies 

n 

r(\J2(Pf-f(yi))\>V^Pf] 

8=1 

where y = (yi, . . . , y n ) lives on an independent copy of (Q n ,P n ). We will show that the 
inequalities 

nPf<J2f( yi ) + V^f, u - m wir 11 



imply that 

E 

<p(f(x,y)) ~~ Vio g 2, 

If we define by F y the probability measure on the space of y, it would mean that 



£(/(w) - /(*<)) >u _( 2 y/2 



e A) < P ,(| £>/ - /(,,)) | > v^p7) < > 

/ E(/(yO - /(*<)) ^ 2 V /2 
< p v sup ^ w v , — > u - - — * 

" A / <p(f(x,y)) ~ Vlog2 



2 \i/2 

u 



log 2 



and taking expectation of both sides with respect to x would prove (jl.lOj) . To show the 
remaining implication we consider two cases when nPf < £ f{yi) and nPf > £ f{yi)- First 
assume that nPf < E/(y*)- Since, as easily checked, both <p(p) and p/<p(p) are increasing 
we get 

UPf - /(xQ) E(/(yQ - fej) < E(/(yQ - /(*0) 
^(p/) - ^(n- 1 £ /(%)) " y)) ' 

In the case nPf > ^ /(f/i) we have 



E(p/ - f(xj)) < E(/(yi) - /Mj < E(/(yi) - /te)) v 7 ^/ 

</?(P/) " y»(P/) <p(Pf) ~ <p{f{x,y)) <p{Pf)' 



The assumption D(!F, yPj) > D(J^, 1) > 2 garantees that (p(Pf) > ynPf log 2 and, finally, 

< E(/(yQ-/fo)) | / 2 V /2 

" <p(J(x,y)) llog2i 
which completes the proof of (|1.1L)|) . We have 

E(/(y»)-/fo)) >> \ wv ( T,£i(f(vi)-f(*i)) > 

SU P 777 u >u) =EPJ sup -j- > u 

where (gj) is a sequence of Rademacher random variables. We will show that there exists u 
independent of n such that for any x, y G Q n 

Fjsup^^-^ >u)< 1 -. 
e \/ <p(J{x,y)) ~ ) 2 
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Clearly, this will prove the statement of the theorem. For a fixed i,t/G Q n let 
F = {(f( Xl ), f(x n ), f( Vl ), . . . , f(y n )) :fef}cR 2n 

and 

d (f>9)=(^it(fi-9i) 2 ) 1/2 , f>9£F. 
i=i 

The packing number of F with respect to d can be bounded by D(F,u,d) < D(T,u). 
Consider an increasing sequence of sets 

{0} = F C F 1 C F 2 C . . . 

such that for any g h £ Fj, d(g, h) > and for all / G F there exists g G Fj such that 
d(f,g) < 2~ J . The cardinality of Fj can be bounded by 

|F,| < D(F,2~ j ,d) < D(F,2~ j ). 

For simplicity of notations we will write D{u) := D{!F,u). If D{2~^) = D(2~ : '~ 1 ) then in 
the construction of the sequence (Fj) we will set Fj equal to Fj +1 . We will now define the 
sequence of projections 7Tj : F — > Fj, j > in the following way. If / G F is such that 
d(f, 0) G (2-J- 1 , 2~i] then set tt (/) = . . . = 7Tj(/) = and for k > j + 1 choose ir k (f) G F fe 
such that d(f,n k (f)) < 2~ k . In the case when F k = F fc+1 we will choose n k (f) = ir k+ i(f). 
This construction implies that d(n k -i(f) : TT k (f)) < 2~ k+2 . Let us introduce a sequence of sets 

Aj = {g - h : g G Fj, h G Fj_ 1; fc) < 2"^ 2 }, j > 1, 

and let Aj = {0} if D{2- j ) = F>(2^' +1 ). The cardinality of Aj does not exceed 

|Aj| < |Fj| 2 < F>(2^) 2 . 

By construction any / G F can be represented as a sum of elements from Aj 



Let 



/j = y v^s^r^ 



and define the event 



A = I J {sup VW/*+« - fi) > ulj}. 
On the complement A c of the event A we have for any / G F such that 0) G (2 -3 ' -1 , 2~- 7 ] 



5^ei(/i+n-/i) = E 53 e <(( 7r fc(/) -7r *-i(/))i+n - ( 7r *(/) -7r *-i(/))0 
i=i fc>j+i i=i 

2-i-l (/)l/2 



k>j+i 



where / = (2n) 1 ^2 i<2n /*> s i nce 2 1 < cf(/, 0) < (f) 1 ^ 2 - It remains to prove that for some 
absolute constant u, P(A) < 1/2. Indeed, 

oo n 
P{A) < Y>(SU P YW/ <+n ~ fi) > Ul 3 ) 

oo 2 j2 

< ^|A,|exp{-^,}/(D(2-) > D(2-^)) 
i=i 

3=1 

since for / G Aj 

n 2n 

J2(f* + n-f l ) 2 <2j2f?<n4-2-^\ 

i=l i=l 

The fact that D(u) is decreasing implies 
and, therefore, 

oo 

P{A) < ^exp{-logD(2- J )( M 2 2- 8 -2)}/(D(2- J ) > D(2- j+1 )) 

CO - CO . . 

i=i v ; 3=2 J 

for a = u 2 /2 8 — 2 big enough. 



Combining Theorem 2 and Theorem 3 we get 

Corollary 1 If Al.fy) holds then there exists an absolute constant K > such that for any 
u > with probability at least 1 — 2e~ u for all f G T 

^(P/-/(^))<^(V^ / V^og D(F,u)du+ jnuPf). 

i=l n 



1.2 Bracketing entropy conditions. 

Given two functions g, h : Q — > [0, 1] such that g < h and (P(h — g) 2 ) 1 ^ 2 < uwe will call a set 
of all functions / such that # < / < h a u— bracket with respect to /^(-P). The u— bracketing 
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number Nn(J-, u, L2(P)) is the minimum number of u— brackets needed to cover T. Assume 
that 

oo 

J ^ log N {] (F,u,L 2 (P))du < oo (1.11) 



and denote 

VP 

¥>{p) = ^J \/ log ^{F,u,L 2 {P))du. 
o 

Then the following theorem holds. 

Theorem 4 Assume that N^{T, 1, L 2 {P)) > 2 and hl.ll\) holds. If ip is defined by \1.9i) 
then the median 

1 n 

M = M(sup J2(Pf - /(arO)) < K{F) < oo, 

where K(F) does not depend on n. 

We omit the proof of this theorem since it is a modification of a standard bracketing 
entropy bound (see Theorem 2.5.6 and 2.14.2 in [T^]) similar to what Theorem 3 is to the 
standard uniform entropy bound. The argument is more subtle as it involves a truncation 
argument required by the application of Bernstein's inequality but otherwise it repeats The- 
orem 3. Combining Theorem 2 and Theorem 4 we get 

Corollary 2 If 111. 11)) holds then there exists an absolute constant K > such that for any 
u > with probability at least 1 — 2e~" for all f G T 

^2(Pf-f( Xi ))<K(yE / ^logiV D (^, u, L 2 (P))du + y/nuPf) . 
i=i { 

2 Examples of application. 

Example 1 (VC-subgraph classes of functions). A class of functions T is called VC-subgraph 
if the class of sets 

C = ({(u,t) : uj en,t G R,t < f(u)} : / G t\ 

is a VC-class of sets in Q x R. The VC dimension of T is equal to the VC dimension d of C. 
On can use Corollary 3 in jl] to show that 

D(F,u)<e(d+l)(^y. 
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n(Pfy 



Corollary 1 implies in this case that for any 5 > with probability at least 1 — 5 for all 

ptw- /(*•» ^ H 1/2 + G lo « I) 1/2 ) ■ i 2 ' 1 ) 

where K > is an absolute constant. Instead of the logra on the right-hand side of (j2.1|) 
one could also write log(l/P/), but we simplify the bound to eliminate this dependence on 
Pf. Note that the bound is similar to the bound (jl.2j) for VC classes of set and VC-major 
classes. Unfortunately, our proof does not allow us to recover the same small value of K = 2 
as for VC classes of sets. 

(|2.1|) improves the main result in [7J, where it was shown that for any fixed v > for 
any 5 > with probability at least 1 — 5 for all / £ T 

£(*7 -/(**)) ^ v ( i / i . 



/ 1 / 1 1\\V2 



It is easy to see that, in a sense, one would get (j2.1J) from ()2.2|) only after optimizing over v. 
Indeed, for Pf < v, ()2.2|) gives 

i^/-/w)^G( dlo ^ +lo 4)) 1/2 ' 

which is implied by (12. lj) as well. For P/ > z/, (j2.2j) gives 

iEw-/fe))^(-) 1/2 (-('"°s i + '°4 

n z — ' v v J \ n \ v 5 



1/2 



which compared to (|2.1|) contains an additional factor of (Pf /u) 1 ^ 2 . In the situation when v 
is small (this is the only interesting case) this factor introduces an unnecessary penalty for 
any function / such that Pf ^> v. Hence, for a fixed v (|2.2j) improves the bound for Pf < v 
at cost of / with Pf > v. 

One can find alternative extensions of (J2.2|) in [5] . For some other applications of Corol- 
lary 1 see |TU] . 

Example 2 (Bracketing entropy). Assume that either 

D(F,u) < cu~~< or iV D (.P,w,L 2 (P)) < cu~\ 7 G (0,2). 

Then Corollary 1 or Corollary 2 imply that for u > with probability at least 1 — 2e~" for 
all / e T 

Pf-f<^((Pf)^ + (uPff*). 



If / = then it is easy to see that for u <n 2 +~i we have 

Pf < KjU"^ . 

As an example, if T is a class of indicator functions for sets with a— smooth boundary 
in [0, 1]' and P is Lebesgue absolutely continuous with bounded density then well known 
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bounds on the bracketing entropy due to Dudley (see [3]) imply that 7 = 2(7 — l)/a and 
Pf < K a n~ i-i+Q . Even though 7 = 2(7 — l)/at may be greater than 2 and Corollary 2 is not 
immediately applicable, one can generalize Theorem 4 to different choices of (f(x), using the 
standard truncation in the chaining argument, to obtain the above rates even for 7 > 2. 

Acknowledgments. We would like to thank the referee for several very helpful com- 
ments and suggestions. 
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